13 research outputs found

    An unsupervised classification scheme for improving predictions of prokaryotic TIS

    Get PDF
    BACKGROUND: Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes. RESULTS: We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum. CONCLUSION: On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool »TICO«(TIs COrrector) which is publicly available from our web site

    Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites

    Get PDF
    BACKGROUND: Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals. RESULTS: We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon. CONCLUSIONS: We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems

    Gene prediction in metagenomic fragments: A large scale machine learning approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.</p> <p>Results</p> <p>We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.</p> <p>Conclusion</p> <p>Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).</p

    Analysis of translation initiation sites in prokaryotic genomes with machnine learning methods

    No full text
    Die exakte Annotation von Translationsstarts in prokaryotischen Genomen mit automatischen Systemen ist noch immer problematisch. Im Folgenden werden zwei Verfahren aus dem Bereich des Maschinellen Lernens zur Verbesserung der Annotation prokaryotischer Genome, vorgestellt: Der Oligo-Kern-Algorithmus, ein ĂŒberwachtes Verfahren zur Analyse von Signalen in biologischen Sequenzen und TICO (Translation Initiation site COrrection), ein Programm zur (Re-)Annotation von Translationsstarts mit einem unĂŒberwachten Lernverfahren.Es wird gezeigt, dass der Oligo-Kern-Algorithmus fĂŒr die Analyse und Identifikation biologischer Signale gut geeignet ist. In einer Fallstudie zu Translationsstarts des Eubakteriums Escherichia coli K-12 wird belegt, dass der Oligo-Klassifikator eine hohe Performanz bei der Vorhersage auf experimentell verifizierten Daten aufweist. Eine Visualisierung der diskriminativen Merkmale ermöglicht eine biologisch sinnvolle Interpretation. FĂŒr E. coli K-12 werden bekannte Signale zur Initiation der Translation eindeutig und korrekt mit der ihnen innewohnenden VariabilitĂ€t detektiert. Der Algorithmus ist flexibel hinsichtlich der LĂ€nge der betrachteten Oligomere und des Grades an Positionsinformation, so dass er auf die Analyse anderer biologischer Sequenzen angepasst werden kann.Das Programm TICO erzielt eine signifikante Verbesserung der Vorhersage von prokaryotischen Translationsstarts im Vergleich zu frĂŒheren AnsĂ€tzen. Dabei wird eine initiale Annotation, wie sie beispielsweise mit einem klassischen Genvorhersageprogramm erstellt werden kann, nachbearbeitet. Die Verbesserung bei der Nachbearbeitung solcher Annotationen betrĂ€gt bis zu 30%. Der Algorithmus ist robust und bietet eine Visualisierungsfunktion, welche eine intuitive Darstellung der diskriminativen Merkmale ermöglicht. Das Programm ist ĂŒber ein Web-Interface (Webschnittstelle) und als Kommandozeilenprogramm fĂŒr Linux und Windows implementiert und frei verfĂŒgbar.Exact localization of prokaryotic translation initiation sites with automated prediction systems is still not completely solved. In this context, two approaches from the field of machine learning have been developed: The Oligo Kernel algorithm, a supervised learning method for analysis of signaling in biological sequences and TICO (Translation Initiation site COrrection), a tool for (re-)annotation of translation initiation sites with an unsupervised classification scheme.It is shown that the Oligo Kernel algorithm is well suitable for analysis of biological signals. In a case study on translation initiation sites of eubacterium Escherichia coli K-12 the high performance of the Oligo classificator is demonstrated on experimentally verified data. A visualization of the discriminative signals facilitates a biologically meaningful interpretation. For E. coli K-12 commonly known signals for translation initiation and their inherent variability can be clearly identified. Since the algorithm is flexible regarding the degree of positional smoothing it can be adapted to analysis of other biological signals.The program TICO significantly improves prediction of prokaryotic translation initiation sites as compared to previous approaches, by post-processing an initial gene annotation as obtained by a classical gene finder. The improvement of such a reannotation amounts up to 30%. The algorithm provides a visualization method allowing an intuitive presentation of the discriminative features. The program can be accessed through a web interface and is freely available as command line tool for Linux and Windows

    YACOP: Enhanced gene prediction obtained by a combination of existing methods

    No full text
    The performance of gene-predicting tools varies considerably if evaluated with respect to the parameters sensitivity and specificity or their capability to identify the correct start codon. We were interested to validate tools for gene prediction and to implement a metatool named YACOP, which combines existing tools and has a higher performance. YACOP parses and combines the output of the three gene-predicting systems Criticia, Glimmer and ZCURVE. It outperforms each of the programs tested with its high sensitivity and specificity values combined with a larger number of correctly predicted gene starts. Performance of YACOP and the gene-finding programs was tested by comparing their output with a carefully selected set of annotated genomes. We found that the problem of identifying genes in prokaryotic genomes by means of computational analysis was solved satisfactorily. In contrast, the correct localization of the start codon still appeared to be a problem, as in all cases under test at least 7.8% and up to 32.3% of the positions given in the annotations differed from the locus predicted by any of the programs tested. YACOP can be downloaded from http://www.g2l.bio.uni-goettingen.de

    iBeetle-Base: a database for RNAi phenotypes in the red flour beetle Tribolium castaneum

    No full text
    The iBeetle-Base (http://ibeetle-base.uni-goettingen.de) makes available annotations of RNAi phenotypes, which were gathered in a large scale RNAi screen in the red flour beetle Tribolium castaneum (iBeetle screen). In addition, it provides access to sequence information and links for all Tribolium castaneum genes. The iBeetle-Base contains the annotations of phenotypes of several thousands of genes knocked down during embryonic and metamorphic epidermis and muscle development in addition to phenotypes linked to oogenesis and stink gland biology. The phenotypes are described according to the EQM (entity, quality, modifier) system using controlled vocabularies and the Tribolium morphological ontology (TrOn). Furthermore, images linked to the respective annotations are provided. The data are searchable either for specific phenotypes using a complex ‘search for morphological defects’ or a ‘quick search’ for gene names and IDs. The red flour beetle Tribolium castaneum has become an important model system for insect functional genetics and is a representative of the most species rich taxon, the Coleoptera, which comprise several devastating pests. It is used for studying insect typical development, the evolution of development and for research on metabolism and pest control. Besides Drosophila, Tribolium is the first insect model organism where large scale unbiased screens have been performed
    corecore